Energy profile of rollback-recovery strategies in high performance computing

نویسندگان

  • Esteban Meneses
  • Osman Sarood
  • Laxmikant V. Kalé
چکیده

Extreme-scale computing is set to provide the infrastructure for the advances and breakthroughs that will solve some of the hardest problems in science and engineering. However, resilience and energy concerns loom as two of the major challenges for machines at that scale. The number of components that will be assembled in the supercomputers plays a fundamental role in these challenges. First, a large number of parts will substantially increase the failure rate of the system compared to the failure frequency of current machines. Second, those components have to fit within the power envelope of the installation and keep the energy consumption within operational margins. Extreme-scale machines will have to incorporate fault tolerance mechanisms and honor the energy and power restrictions. Therefore, it is essential to understand how fault tolerance and energy consumption interplay. This paper presents a comparative evaluation and analysis of energy consumption in three different rollback-recovery protocols: checkpoint/restart, message logging and parallel recovery. Our experimental evaluation shows parallel recovery has the minimum execution time and energy consumption. Additionally, we present an analytical model that projects parallel recovery can reduce energy consumption more than 37% compared to checkpoint/restart at extreme scale.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable Message - Logging Techniques for Effective Fault Tolerance in Hpc Applications

An important set of challenges emerge as the High Performance Computing (HPC) community aims to reach extreme scale. Resilience and energy consumption are two of those challenges. Extreme-scale machines are expected to have a high failure frequency. This is an inevitable consequence of the mismatch between two trends. The number of components assembled in supercomputers grows exponentially. How...

متن کامل

Green Energy-aware task scheduling using the DVFS technique in Cloud Computing

Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...

متن کامل

A Survey and Performance Analysis of Checkpointing and Recovery Schemes for Mobile Computing Systems

A SURVEY AND PERFORMANCE ANALYSIS OF CHECKPOINTING AND RECOVERY SCHEMES FOR MOBILE COMPUTING SYSTEMS Ruchi Tuli1 and Parveen Kumar2 1Yanbu University College, Royal Commission for Jubail and Yanbu, Directorate General for Yanbu, P.O. Box 30436 Madinat Yanbu Al Sinaiyah Kingdom of Saudi Arabia., E-mail : [email protected] 2Merrut Institute of Engineering and Technology, Merrut (INDIA) E-mail ...

متن کامل

A Distributed and Replicated Service for Checkpoint Storage

As High Performance platforms (Clusters, Grids, etc.) continue to grow in size, the average time between failures decreases to a critical level. An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique reli...

متن کامل

A User-triggered Checkpointing Library for Computationintensive Applications

We propose a method to incorporate coordinated checkpointing and rollback in high performance computing applications on massively parallel computers. A library allows the user to specify which data-items (including files) belong to the contents of the checkpoint, and to trigger the checkpointing in the application. The recovery-line management on the distributed disk system takes care of which ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Parallel Computing

دوره 40  شماره 

صفحات  -

تاریخ انتشار 2014